Hello, PyTorch

This notebook will teach you to use PyTorch low-level core. You can install it here. For high-level interface see the next notebook.

PyTorch feels differently than tensorflow/theano on almost every level. TensorFlow makes your code live in two "worlds" simultaneously: symbolic graphs and actual tensors. First you declare a symbolic "recipe" of how to get from inputs to outputs, then feed it with actual minibatches of data. In PyTorch, there's only one world: all tensors have a numeric value.

You compute outputs on the fly without pre-declaring anything. The code looks exactly as in pure numpy with one exception: PyTorch computes gradients for you. And can run stuff on GPU. And has a number of pre-implemented building blocks for your neural nets. And a few more things.

And now we finally shut up and let PyTorch do the talking.


In [ ]:
import sys
if 'google.colab' in sys.modules and not os.path.exists('.setup_complete'):
    !wget -q https://raw.githubusercontent.com/yandexdataschool/Practical_RL/spring20/week04_%5Brecap%5D_deep_learning/notmnist.py

    !touch .setup_complete

In [ ]:
import numpy as np
import torch
print(torch.__version__)


1.4.1

In [ ]:
# numpy world

x = np.arange(16).reshape(4, 4)

print("X :\n%s\n" % x)
print("X.shape : %s\n" % (x.shape,))
print("add 5 :\n%s\n" % (x + 5))
print("X*X^T  :\n%s\n" % np.dot(x, x.T))
print("mean over cols :\n%s\n" % (x.mean(axis=-1)))
print("cumsum of cols :\n%s\n" % (np.cumsum(x, axis=0)))


X :
[[ 0  1  2  3]
 [ 4  5  6  7]
 [ 8  9 10 11]
 [12 13 14 15]]

X.shape : (4, 4)

add 5 :
[[ 5  6  7  8]
 [ 9 10 11 12]
 [13 14 15 16]
 [17 18 19 20]]

X*X^T  :
[[ 14  38  62  86]
 [ 38 126 214 302]
 [ 62 214 366 518]
 [ 86 302 518 734]]

mean over cols :
[ 1.5  5.5  9.5 13.5]

cumsum of cols :
[[ 0  1  2  3]
 [ 4  6  8 10]
 [12 15 18 21]
 [24 28 32 36]]


In [ ]:
# PyTorch world

x = np.arange(16).reshape(4, 4)

x = torch.tensor(x, dtype=torch.float32)  # or torch.arange(0,16).view(4,4)

print("X :\n%s" % x)
print("X.shape : %s\n" % (x.shape,))
print("add 5 :\n%s" % (x + 5))
print("X*X^T  :\n%s" % torch.matmul(x, x.transpose(1, 0)))  # short: x.mm(x.t())
print("mean over cols :\n%s" % torch.mean(x, dim=-1))
print("cumsum of cols :\n%s" % torch.cumsum(x, dim=0))


X :
tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  5.,  6.,  7.],
        [ 8.,  9., 10., 11.],
        [12., 13., 14., 15.]])
X.shape : torch.Size([4, 4])

add 5 :
tensor([[ 5.,  6.,  7.,  8.],
        [ 9., 10., 11., 12.],
        [13., 14., 15., 16.],
        [17., 18., 19., 20.]])
X*X^T  :
tensor([[ 14.,  38.,  62.,  86.],
        [ 38., 126., 214., 302.],
        [ 62., 214., 366., 518.],
        [ 86., 302., 518., 734.]])
mean over cols :
tensor([ 1.5000,  5.5000,  9.5000, 13.5000])
cumsum of cols :
tensor([[ 0.,  1.,  2.,  3.],
        [ 4.,  6.,  8., 10.],
        [12., 15., 18., 21.],
        [24., 28., 32., 36.]])

NumPy and PyTorch

As you can notice, PyTorch allows you to hack stuff much the same way you did with numpy. No graph declaration, no placeholders, no sessions. This means that you can see the numeric value of any tensor at any moment of time. Debugging such code can be done with by printing tensors or using any debug tool you want (e.g. gdb).

You could also notice the a few new method names and a different API. So no, there's no compatibility with numpy yet and yes, you'll have to memorize all the names again. Get excited!

For example,

  • If something takes a list/tuple of axes in numpy, you can expect it to take *args in PyTorch
    • x.reshape([1,2,8]) -> x.view(1,2,8)
  • You should swap axis for dim in operations like mean or cumsum
    • x.sum(axis=-1) -> x.sum(dim=-1)
  • most mathematical operations are the same, but types an shaping is different
    • x.astype('int64') -> x.type(torch.LongTensor)

To help you acclimatize, there's a table covering most new things. There's also a neat documentation page.

Finally, if you're stuck with a technical problem, we recommend searching PyTorch forumns. Or just googling, which usually works just as efficiently.

If you feel like you almost give up, remember two things: GPU an free gradients. Besides you can always jump back to numpy with x.numpy()

Warmup: trigonometric knotwork

inspired by this post

There are some simple mathematical functions with cool plots. For one, consider this:

$$ x(t) = t - 1.5 * cos( 15 t) $$$$ y(t) = t - 1.5 * sin( 16 t) $$

In [ ]:
import matplotlib.pyplot as plt
%matplotlib inline

t = torch.linspace(-10, 10, steps=10000)

# compute x(t) and y(t) as defined above
x = <YOUR CODE>
y = <YOUR CODE>

plt.plot(x.numpy(), y.numpy())

if you're done early, try adjusting the formula and seing how it affects the function










Automatic gradients

Any self-respecting DL framework must do your backprop for you. Torch handles this with the autograd module.

The general pipeline looks like this:

  • When creating a tensor, you mark it as requires_grad:
    • torch.zeros(5, requires_grad=True)
    • torch.tensor(np.arange(5), dtype=torch.float32, requires_grad=True)
  • Define some differentiable loss = arbitrary_function(a)
  • Call loss.backward()
  • Gradients are now available as a.grads

Here's an example: let's fit a linear regression on Boston house prices


In [ ]:
from sklearn.datasets import load_boston
boston = load_boston()
plt.scatter(boston.data[:, -1], boston.target)


Out[ ]:
<matplotlib.collections.PathCollection at 0x7fb9a5c1d898>

In [ ]:
from torch.autograd import Variable
w = torch.zeros(1, requires_grad=True)
b = torch.zeros(1, requires_grad=True)

x = torch.tensor(boston.data[:, -1] / 10, dtype=torch.float32)
y = torch.tensor(boston.target, dtype=torch.float32)

In [ ]:
y_pred = w * x + b
loss = torch.mean((y_pred - y)**2)

# propagete gradients
loss.backward()

The gradients are now stored in .grad of those variables that require them.


In [ ]:
print("dL/dw = \n", w.grad)
print("dL/db = \n", b.grad)


dL/dw = 
 tensor([-47.3514])
dL/db = 
 tensor([-45.0656])

If you compute gradient from multiple losses, the gradients will add up at variables, therefore it's useful to zero the gradients between iteratons.


In [ ]:
from IPython.display import clear_output

for i in range(100):

    y_pred = w * x + b
    loss = torch.mean((y_pred - y)**2)
    loss.backward()

    w.data -= 0.05 * w.grad.data
    b.data -= 0.05 * b.grad.data

    # zero gradients
    w.grad.data.zero_()
    b.grad.data.zero_()

    # the rest of code is just bells and whistles
    if (i+1) % 5 == 0:
        clear_output(True)
        plt.scatter(x.data.numpy(), y.data.numpy())
        plt.scatter(x.data.numpy(), y_pred.data.numpy(),
                    color='orange', linewidth=5)
        plt.show()

        print("loss = ", loss.data.numpy())
        if loss.data.numpy() < 0.5:
            print("Done!")
            break


loss =  44.59417

Bonus quest: try implementing and writing some nonlinear regression. You can try quadratic features or some trigonometry, or a simple neural network. The only difference is that now you have more variables and a more complicated y_pred.

High-level PyTorch

So far we've been dealing with low-level torch API. While it's absolutely vital for any custom losses or layers, building large neura nets in it is a bit clumsy.

Luckily, there's also a high-level torch interface with a pre-defined layers, activations and training algorithms.

We'll cover them as we go through a simple image recognition problem: classifying letters into "A" vs "B".


In [ ]:
from notmnist import load_notmnist
X_train, y_train, X_test, y_test = load_notmnist(letters='AB')
X_train, X_test = X_train.reshape([-1, 784]), X_test.reshape([-1, 784])

print("Train size = %i, test_size = %i" % (len(X_train), len(X_test)))


Parsing...
/home/jheuristic/anaconda3/lib/python3.6/site-packages/scipy/misc/pilutil.py:482: FutureWarning: Conversion of the second argument of issubdtype from `int` to `np.signedinteger` is deprecated. In future, it will be treated as `np.int64 == np.dtype(int).type`.
  if issubdtype(ts, int):
/home/jheuristic/anaconda3/lib/python3.6/site-packages/scipy/misc/pilutil.py:485: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`.
  elif issubdtype(type(size), float):
found broken img: ./notMNIST_small/A/RGVtb2NyYXRpY2FCb2xkT2xkc3R5bGUgQm9sZC50dGY=.png [it's ok if <10 images are broken]
Done
Train size = 2808, test_size = 937

In [ ]:
for i in [0, 1]:
    plt.subplot(1, 2, i + 1)
    plt.imshow(X_train[i].reshape([28, 28]))
    plt.title(str(y_train[i]))


Let's start with layers. The main abstraction here is torch.nn.Module


In [ ]:
from torch import nn
import torch.nn.functional as F

print(nn.Module.__doc__)


Base class for all neural network modules.

    Your models should also subclass this class.

    Modules can also contain other Modules, allowing to nest them in
    a tree structure. You can assign the submodules as regular attributes::

        import torch.nn as nn
        import torch.nn.functional as F

        class Model(nn.Module):
            def __init__(self):
                super(Model, self).__init__()
                self.conv1 = nn.Conv2d(1, 20, 5)
                self.conv2 = nn.Conv2d(20, 20, 5)

            def forward(self, x):
               x = F.relu(self.conv1(x))
               return F.relu(self.conv2(x))

    Submodules assigned in this way will be registered, and will have their
    parameters converted too when you call `.cuda()`, etc.
    

There's a vast library of popular layers and architectures already built for ya'.

This is a binary classification problem, so we'll train a Logistic Regression with sigmoid. $$P(y_i | X_i) = \sigma(W \cdot X_i + b) ={ 1 \over {1+e^{- [W \cdot X_i + b]}} }$$


In [ ]:
# create a network that stacks layers on top of each other
model = nn.Sequential()

# add first "dense" layer with 784 input units and 1 output unit.
model.add_module('l1', nn.Linear(784, 1))

# add softmax activation for probabilities. Normalize over axis 1
# note: layer names must be unique
model.add_module('l2', nn.Sigmoid())

In [ ]:
print("Weight shapes:", [w.shape for w in model.parameters()])


Weight shapes: [torch.Size([1, 784]), torch.Size([1])]

In [ ]:
# create dummy data with 3 samples and 784 features
x = torch.tensor(X_train[:3], dtype=torch.float32)
y = torch.tensor(y_train[:3], dtype=torch.float32)

# compute outputs given inputs, both are variables
y_predicted = model(x)[:, 0]

y_predicted  # display what we've got


Out[ ]:
tensor([ 0.4526,  0.4411,  0.5917])

Let's now define a loss function for our model.

The natural choice is to use binary crossentropy (aka logloss, negative llh): $$ L = {1 \over N} \underset{X_i,y_i} \sum - [ y_i \cdot log P(y_i=1 | X_i) + (1-y_i) \cdot log (1-P(y_i=1 | X_i)) ]$$


In [ ]:
crossentropy = <YOUR CODE>

loss = <YOUR CODE>

assert tuple(crossentropy.size()) == (
    3,), "Crossentropy must be a vector with element per sample"
assert tuple(loss.size()) == tuple(
), "Loss must be scalar. Did you forget the mean/sum?"
assert loss.data.numpy() > 0, "Crossentropy must non-negative, zero only for perfect prediction"
assert loss.data.numpy() <= np.log(
    3), "Loss is too large even for untrained model. Please double-check it."

Note: you can also find many such functions in torch.nn.functional, just type F.<tab>.

Torch optimizers

When we trained Linear Regression above, we had to manually .zero_() gradients on both our variables. Imagine that code for a 50-layer network.

Again, to keep it from getting dirty, there's torch.optim module with pre-implemented algorithms:


In [ ]:
opt = torch.optim.RMSprop(model.parameters(), lr=0.01)

# here's how it's used:
loss.backward()      # add new gradients
opt.step()           # change weights
opt.zero_grad()      # clear gradients

In [ ]:
# dispose of old variables to avoid bugs later
del x, y, y_predicted, loss, y_pred

Putting it all together


In [ ]:
# create network again just in case
model = nn.Sequential()
model.add_module('first', nn.Linear(784, 1))
model.add_module('second', nn.Sigmoid())

opt = torch.optim.Adam(model.parameters(), lr=1e-3)

In [ ]:
history = []

for i in range(100):

    # sample 256 random images
    ix = np.random.randint(0, len(X_train), 256)
    x_batch = torch.tensor(X_train[ix], dtype=torch.float32)
    y_batch = torch.tensor(y_train[ix], dtype=torch.float32)

    # predict probabilities
    y_predicted = <YOUR CODE>

    assert y_predicted.dim(
    ) == 1, "did you forget to select first column with [:, 0]"

    # compute loss, just like before
    loss = <YOUR CODE>

    # compute gradients
    <YOUR CODE>

    # Adam step
    <YOUR CODE>

    # clear gradients
    <YOUR CODE>

    history.append(loss.data.numpy())

    if i % 10 == 0:
        print("step #%i | mean loss = %.3f" % (i, np.mean(history[-10:])))


step #0 | mean loss = 0.573
step #10 | mean loss = 0.371
step #20 | mean loss = 0.218
step #30 | mean loss = 0.159
step #40 | mean loss = 0.141
step #50 | mean loss = 0.127
step #60 | mean loss = 0.131
step #70 | mean loss = 0.107
step #80 | mean loss = 0.116
step #90 | mean loss = 0.101

Debugging tips:

  • make sure your model predicts probabilities correctly. Just print them and see what's inside.
  • don't forget minus sign in the loss function! It's a mistake 99% ppl do at some point.
  • make sure you zero-out gradients after each step. Srsly:)
  • In general, PyTorch's error messages are quite helpful, read 'em before you google 'em.
  • if you see nan/inf, print what happens at each iteration to find our where exactly it occurs.
    • If loss goes down and then turns nan midway through, try smaller learning rate. (Our current loss formula is unstable).

Evaluation

Let's see how our model performs on test data


In [ ]:
# use your model to predict classes (0 or 1) for all test samples
predicted_y_test = <YOUR CODE>

assert isinstance(predicted_y_test, np.ndarray), "please return np array, not %s" % type(
    predicted_y_test)
assert predicted_y_test.shape == y_test.shape, "please predict one class for each test sample"
assert np.in1d(predicted_y_test, y_test).all(), "please predict class indexes"

accuracy = np.mean(predicted_y_test == y_test)

print("Test accuracy: %.5f" % accuracy)
assert accuracy > 0.95, "try training longer"


Test accuracy: 0.96585

More about PyTorch:

  • Using torch on GPU and multi-GPU - link
  • More tutorials on PyTorch - link
  • PyTorch examples - a repo that implements many cool DL models in PyTorch - link
  • Practical PyTorch - a repo that implements some... other cool DL models... yes, in PyTorch - link
  • And some more - link





Homework tasks

There will be three tasks worth 2, 3 and 5 points respectively. If you get stuck with no progress, try switching to the next task and returning later.

Task I (2 points) - tensormancy

When dealing with more complex stuff like neural network, it's best if you use tensors the way samurai uses his sword.

1.1 the cannabola disclaimer

Let's write another function, this time in polar coordinates: $$\rho(\theta) = (1 + 0.9 \cdot cos (8 \cdot \theta) ) \cdot (1 + 0.1 \cdot cos(24 \cdot \theta)) \cdot (0.9 + 0.05 \cdot cos(200 \cdot \theta)) \cdot (1 + sin(\theta))$$

Then convert it into cartesian coordinates (howto) and plot the results.

Use torch tensors only: no lists, loops, numpy arrays, etc.


In [ ]:
theta = torch.linspace(- np.pi, np.pi, steps=1000)

# compute rho(theta) as per formula above
rho = <YOUR CODE>

# Now convert polar (rho, theta) pairs into cartesian (x,y) to plot them.
x = <YOUR CODE>
y = <YOUR CODE>


plt.figure(figsize=[6, 6])
plt.fill(x.numpy(), y.numpy(), color='green')
plt.grid()

Task II: the game of life (3 points)

Now it's time for you to make something more challenging. We'll implement Conway's Game of Life in pure PyTorch.

While this is still a toy task, implementing game of life this way has one cool benefit: you'll be able to run it on GPU! Indeed, what could be a better use of your gpu than simulating game of life on 1M/1M grids?

If you've skipped the url above out of sloth, here's the game of life:

  • You have a 2D grid of cells, where each cell is "alive"(1) or "dead"(0)
  • Any living cell that has 2 or 3 neighbors survives, else it dies [0,1 or 4+ neighbors]
  • Any cell with exactly 3 neighbors becomes alive (if it was dead)

For this task, you are given a reference numpy implementation that you must convert to PyTorch. [numpy code inspired by: https://github.com/rougier/numpy-100]

Note: You can find convolution in torch.nn.functional.conv2d(Z,filters). Note that it has a different input format.

Note 2: From the mathematical standpoint, PyTorch convolution is actually cross-correlation. Those two are very similar operations. More info: video tutorial, scipy functions review, stack overflow source.


In [ ]:
from scipy.signal import correlate2d

def np_update(Z):
    # Count neighbours with convolution
    filters = np.array([[1, 1, 1],
                        [1, 0, 1],
                        [1, 1, 1]])

    N = correlate2d(Z, filters, mode='same')

    # Apply rules
    birth = (N == 3) & (Z == 0)
    survive = ((N == 2) | (N == 3)) & (Z == 1)

    Z[:] = birth | survive
    return Z

In [ ]:
def torch_update(Z):
    """
    Implement an update function that does to Z exactly the same as np_update.
    :param Z: torch.FloatTensor of shape [height,width] containing 0s(dead) an 1s(alive)
    :returns: torch.FloatTensor Z after updates.

    You can opt to create new tensor or change Z inplace.
    """

    # <Your code here!>

    return Z

In [ ]:
# initial frame
Z_numpy = np.random.choice([0, 1], p=(0.5, 0.5), size=(100, 100))
Z = torch.from_numpy(Z_numpy).type(torch.FloatTensor)

# your debug polygon :)
Z_new = torch_update(Z.clone())

# tests
Z_reference = np_update(Z_numpy.copy())
assert np.all(Z_new.numpy(
) == Z_reference), "your PyTorch implementation doesn't match np_update. Look into Z and np_update(ZZ) to investigate."
print("Well done!")

In [ ]:
%matplotlib notebook
plt.ion()

# initialize game field
Z = np.random.choice([0, 1], size=(100, 100))
Z = torch.from_numpy(Z).type(torch.FloatTensor)

fig = plt.figure()
ax = fig.add_subplot(111)
fig.show()

for _ in range(100):

    # update
    Z = torch_update(Z)

    # re-draw image
    ax.clear()
    ax.imshow(Z.numpy(), cmap='gray')
    fig.canvas.draw()

In [ ]:
# Some fun setups for your amusement

# parallel stripes
Z = np.arange(100) % 2 + np.zeros([100, 100])
# with a small imperfection
Z[48:52, 50] = 1

Z = torch.from_numpy(Z).type(torch.FloatTensor)

fig = plt.figure()
ax = fig.add_subplot(111)
fig.show()

for _ in range(100):
    Z = torch_update(Z)
    ax.clear()
    ax.imshow(Z.numpy(), cmap='gray')
    fig.canvas.draw()

More fun with Game of Life: video






Task III: Going deeper (5 points)

Your ultimate task for this week is to build your first neural network [almost] from scratch and pure torch.

This time you will solve the same digit recognition problem, but at a greater scale

  • 10 different letters
  • 20k samples

We want you to build a network that reaches at least 80% accuracy and has at least 2 linear layers in it. Naturally, it should be nonlinear to beat logistic regression. You can implement it with either

With 10 classes you will need to use Softmax at the top instead of sigmoid and train for categorical crossentropy (see here). Write your own loss or use torch.nn.functional.nll_loss. Just make sure you understand what it accepts as an input.

Note that you are not required to build 152-layer monsters here. A 2-layer (one hidden, one output) neural network should already give you an edge over logistic regression.

[bonus kudos] If you've already beaten logistic regression with a two-layer net, but enthusiasm still ain't gone, you can try improving the test accuracy even further! It should be possible to reach 90% without convnets.

SPOILERS! At the end of the notebook you will find a few tips and frequent errors. If you feel confident enogh, just start coding right away and get there if once you need to untangle yourself.


In [ ]:
from notmnist import load_notmnist
X_train, y_train, X_test, y_test = load_notmnist(letters='ABCDEFGHIJ')
X_train, X_test = X_train.reshape([-1, 784]), X_test.reshape([-1, 784])


Parsing...
found broken img: ./notMNIST_small/F/Q3Jvc3NvdmVyIEJvbGRPYmxpcXVlLnR0Zg==.png [it's ok if <10 images are broken]
found broken img: ./notMNIST_small/A/RGVtb2NyYXRpY2FCb2xkT2xkc3R5bGUgQm9sZC50dGY=.png [it's ok if <10 images are broken]

In [ ]:
%matplotlib inline
plt.figure(figsize=[12, 4])
for i in range(20):
    plt.subplot(2, 10, i+1)
    plt.imshow(X_train[i].reshape([28, 28]))
    plt.title(str(y_train[i]))



In [ ]:
<YOUR CODE: a whole lot of it>

In [ ]:
















SPOILERS!

Recommended pipeline

  • Adapt logistic regression from previous assignment to classify one letter against others (e.g. A vs the rest)
  • Generalize it to multiclass logistic regression.
    • Either try to remember lecture 0 or google it.
    • Instead of weight vector you'll have to use matrix (feature_id x class_id)
    • softmax (exp over sum of exps) can implemented manually or as nn.Softmax (layer) F.softmax (function)
    • probably better to use STOCHASTIC gradient descent (minibatch) for greater speed
      • you can also try momentum/rmsprop/adawhatever
      • in which case sample should probably be shuffled (or use random subsamples on each iteration)
  • Add a hidden layer. Now your logistic regression uses hidden neurons instead of inputs.

    • Hidden layer uses the same math as output layer (ex-logistic regression), but uses some nonlinearity (e.g. sigmoid) instead of softmax
    • You need to train both layers, not just output layer :)
    • 50 hidden neurons and a sigmoid nonlinearity will do for a start. Many ways to improve.
    • In ideal case this totals to 2 torch.matmul's, 1 softmax and 1 relu/sigmoid
    • make sure this neural network works better than logistic regression
  • Now's the time to try improving the network. Consider layers (size, neuron count), nonlinearities, optimization methods, initialization - whatever you want, but please avoid convolutions for now.

  • If anything seems wrong, try going through one step of training and printing everything you compute.

  • If you see NaNs midway through optimization, you can estimate log P(y|x) as via F.log_softmax(layer_before_softmax)